December 1, 2023
Print list of packages and cite them via Pandoc citation.
Variables (e.g., characterisics), units (e.g., persons) and data (e.g., measurements) are often presented in matrix form. A matrix is a system of \(n \cdot p\) quantities and looks like in the following:
\[ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} \]
The mean (or arithmetic mean, average) is the sum of a collection of numbers divided by the count of numbers in the collection. The formula is given in Equation 1.
\[\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i=\frac{x_1+x_2+\dots+x_n}{n} \qquad(1)\]
For example, consider a vector of numbers: \(x = 1, 2, 5, 3, 8\)
\[\bar{x} = \frac{(1+2+5+3+8)}{5}=3.8\]
If the underlying data is a sample (i.e., a subset of a population), it is called the sample mean.
If there is missing data (in R denoted by NA), we set the argument na.rm to TRUE. To demonstrate this we create another example vector (exVec2).
The median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as “the middle” value. The formulas are given in Equation 2.
\[ Mdn = \widetilde{x} = \begin{cases} x_{(n+1)/2} & \:\: \text{if } n \text{ is odd} \\ (x_{n/2} + x_{(n/2)+1}) / 2 & \:\: \text{if } n \text{ is even} \end{cases} \qquad(2)\]
Consider again the vector of numbers: \(x = 1, 2, 5, 3, 8\) with length \(n = 5\). To calculate the median you need to first, order the the vector: \(x = 1, 2, 3, 5, 8\) and then apply the corresponding formula (odd vs. even; here odd):
\[\widetilde{x}=x_{\frac{(5+1)}{2}}=x_3 = 3\]
The variance is the expectation of the squared deviation of a random variable from its mean. Usually it is distinguished between the population and the sample variance. The formula of the population variance is given in Equation 3.
\[VAR(X) = \sigma^2 = \frac{1}{N} \sum\limits_{i=1}^N (x_i - \mu)^2 \qquad(3)\]
The formula of the sample variance is given in Equation 4.
\[ VAR(X) = s^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \bar{x})^2 \qquad(4)\]
Using again the vector \(x = 1, 2, 5, 3, 8\), the sample variance is calculated as follows:
\[Var(X) =\frac{1}{4}((1-3.8)^2 + (2-3.8)^2 + (5-3.8)^2 + (3-3.8)^2 + (8-3.8)^2) = 7.7\]
The standard deviation is defined as the square root of the variance. Again, it is distinguished between the population and the sample variance. The formula of the population standard deviation is given in Equation 5.
\[SD(X) = \sigma = \sqrt{\sigma^2} \qquad(5)\]
The formula of the population standard deviation is given in Equation 6.
\[SD(X) = s = \sqrt{s^2} \qquad(6)\]
Recall the variance calculation from the previous slide, the (sample) variance of the vector is \(7.7\).
\[SD(X) = \sqrt{7.7}=2.774887\]
The range of a vector is the difference between the largest (maximum) and the smallest (minimum) values/observations.
\[Range(x) = R = x_{max}-x_{min} \qquad(7)\]
Recall, the dataset dat is the HSB dataset from the merTools package:
Calculating the mean, standard deviation, minimum and maximum for a set of variables:
c() function.
apply function to apply a or multiple function(s) on data (here: 4 columns).
MARGIN = 2 indicates that the function should be applied over columns.
mean(), sd(), min() and max().
R object, which should be later returned (here: the vector ret)
Print the results…
mathach female ses size
[1,] 12.747853 0.5281837 0.0001433542 1056.8618
[2,] 6.878246 0.4992398 0.7793551951 604.1725
[3,] -2.832000 0.0000000 -3.7580000000 100.0000
[4,] 24.993000 1.0000000 2.6920000000 2713.0000
This is a weird format; variables should be in rows not columns. Transpose…
Better, but still not really convincing…
1exDescrTab <- exDescr |>
2 t() |>
as.data.frame() |>
3 (\(d) cbind(names(myVar), d))() |>
4 flextable() |>
5 theme_apa() |>
6 set_header_labels(
"names(myVar)" = "Variables",
V1 = "Mean",
V2 = "SD",
V3 = "Min",
V4 = "Max") |>
7 align(part = "body", align = "c") |>
align(j = 1, part = "all", align = "l") |>
8 add_footer_lines(
as_paragraph(as_i("Note. "),
"This is a footnote.")
) |>
align(align = "left", part = "footer") |>
9 width(j = 1, width = 2, unit = "in") |>
width(j = 2:5, width = 1, unit = "in")exDescr object)…
transpose (i.e., using the t() function) and coerce it to a data.frame object (as.data.frame())
cbind() function) the variable names as the first column to the dataset.
flextable() function.
theme_apa()).
set_header_labels()).
align()).
add_footer_lines) and align it to the left.
width) to 2 resp. 1 inch.
Print the table.
If you want to export the table…
psych packageAlternatively, it is convenient to use additional R packages such as the psych package (Revelle, 2023) to calculate descriptive statistics
Here we use the describe function (with the fast argument set to TRUE) to calculate the descriptive statistics of all variables within the example data set
vars | n | mean | sd | median | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|
1 | 7,185 | 0.2747390397 | 0.4464137 | 0.000 | 0.000 | 1.000 | 1.000 | 1.00906215 | -0.9819302 | 0.005266525 |
2 | 7,185 | 0.5281837161 | 0.4992398 | 1.000 | 0.000 | 1.000 | 1.000 | -0.11289082 | -1.9875322 | 0.005889736 |
3 | 7,185 | 0.0001433542 | 0.7793552 | 0.002 | -3.758 | 2.692 | 6.450 | -0.22809706 | -0.3804498 | 0.009194372 |
4 | 7,185 | 12.7478526096 | 6.8782457 | 13.131 | -2.832 | 24.993 | 27.825 | -0.18049184 | -0.9215987 | 0.081145473 |
5 | 7,185 | 1,056.8617954071 | 604.1724993 | 1,016.000 | 100.000 | 2,713.000 | 2,613.000 | 0.57149608 | -0.3649453 | 7.127669715 |
6 | 7,185 | 0.4931106472 | 0.4999873 | 0.000 | 0.000 | 1.000 | 1.000 | 0.02755427 | -1.9995190 | 0.005898555 |
7 | 7,185 | 0.0061384830 | 0.4135539 | 0.038 | -1.188 | 0.831 | 2.019 | -0.26812150 | -0.4797392 | 0.004878864 |
Style the table according to your ideas/demands.